Class-Based Weighted NB for Text Categorization
نویسندگان
چکیده
Naïve Bayes classifier is a supervised and probabilistic learning method (Manning, Raghavan, & Schuetze, 2008) which greatly simplifies learning by making the assumption that provided features are conditionally independent. Although this assumption usually does not hold, this classifier proves to compete well with other more sophisticated techniques (Rish, 2001). Moreover, being fast and easy to implement has resulted in frequent use of Naïve Bayes for text classification (Rennie, Shih, Teevan, & Karger, 2003). Studies comparing classification algorithms prove that Naïve Bayes is comparable in performance with decision trees and neural network classifiers (Han, & Kamber 2006). Many enhancements have been proposed so as to relax this unrealistic assumption. These enhancements are mainly in the area of feature selection and feature weighting (Lee, Gutierrez, & Dou, 2011). Feature selection is the process of selecting a subset of proposed features and using only these selected features in text categorization. Feature selection results in two main advantages: Firstly, by decreasing the amount of the effective vocabularies it makes classification more efficient. Secondly, it eliminates noise features and consequently makes classification more accurate (Manning et al., 2008). Feature weighting which obviously assigns a weight to each feature is more flexible than feature selection since feature weighting assigns continuous weights to features while feature selection assigns only 0/1 values (Lee et al., 2011). Many improvements have been proposed in both areas, but weight adjusting considering class attribute has rarely been investigated. In this chapter, we will propose the class-based weighted Naïve Bayes algorithm. In this algorithm, weight adjustment is performed for all samples with the same class attribute in the training dataset. Weight adjustment is achieved by examining different weights for each and every feature in the dataset and selecting the weight which contributes to the best improvement in the classification result. This mechanism will be elaborated further in section 3. This chapter is structured as follows. In the next section, we will provide a brief review of other enhancements proposed to improve Naïve Bayes classifier. In section 3, we will introduce our proposed algorithm, class-based weighted Naïve Bayes algorithm and show the results of our experiments. Finally, a direction for future research and conclusion are given.
منابع مشابه
Based on Weighted Gauss-Newton Neural Network Algorithm for Uneven Forestry Information Text Classification
In order to deal with the problem of low categorization accuracy of minority class of the uneven forestry information text classification algorithm, this paper puts forward the uneven forestry information text classification algorithm based on weighted Gauss-Newton neural network, on the basis of weighted Gauss-Newton algorithm, the algorithm is proved via singular value decomposition principle...
متن کاملWeb Classification Approach Using Reduced Vector Representation Model Based on Html Tags
Automatic web page classification plays an essential role in information retrieval, web mining and web semantics applications. Web pages have special characteristics (such as HTML tags, hyperlinks, etc....) that make their classification different from standard text categorization. Thus, when applied to web data, traditional text classifiers do not usually produce promising results. In this pap...
متن کاملAutomated Arabic Text Categorization Using SVM and NB
Text classification is a supervised learning technique that uses labeled training data to derive a classification system (classifier) and then automatically classifies unlabelled text data using the derived classifier. In this paper, we investigate Naïve Bayesian method (NB) and Support Vector Machine algorithm (SVM) on different Arabic data sets. The bases of our comparison are the most popula...
متن کاملMulticlass Boosting with Adaptive Group-Based kNN and Its Application in Text Categorization
AdaBoost is an excellent committee-based tool for classification. However, its effectiveness and efficiency in multiclass categorization face the challenges from methods based on support vector machine SVM , neural networks NN , naı̈ve Bayes, and k-nearest neighbor kNN . This paper uses a novel multi-class AdaBoost algorithm to avoid reducing the multi-class classification problem to multiple tw...
متن کاملA Survey on text categorization of Indian and non-Indian languages using supervised learning techniques
Categorization of text plays an important role in the text mining field. Text categorization is the process in which documents are categorized into its predefined category. Automatic text categorization is an important task due to large amount of electronic documents. This paper presents a survey of Text categorization of Indian and non-Indian languages. There is very less work done in text cat...
متن کامل